A Language Model Approach to Spam Filtering
نویسنده
چکیده
We present a classification model for semi-structured documents based on statistical language modelling theory which outperforms extant approaches to spam filtering on the LingSpam email corpus [1]. We also introduce two variants of a novel discounting technique for higher-order N -gram language models developed in the light of the spam filtering problem.
منابع مشابه
An Adaptive, Semi-Structured Language Model Approach to Spam Filtering on a New Corpus
Motivated by current efforts to construct more realistic spam filtering experimental corpora, we present a newly assembled, publicly available corpus of genuine and unsolicited (spam) email, dubbed GenSpam. We also propose an adaptive model for semi-structured document classification based on language model component interpolation. We compare this with a number of alternative classification mod...
متن کاملAn Adaptive Approach to Spam Filtering on a New Corpus
Motivated by the absence of rigorous experimentation in the area of spam filtering using realistic email data, we present a newly-assembled corpus of genuine and unsolicited (spam) email, dubbed GenSpam, to be made publicly available. We also propose an adaptive model for semi-structured document classification based on smoothed n-gram language modelling and interpolation, and report promising ...
متن کاملSingle-Class Learning for Spam Filtering: An Ensemble Approach
Spam, also known as Unsolicited Commercial Email (UCE), has been an increasingly annoying problem to individuals and organizations. Most of prior research formulated spam filtering as a classical text categorization task, in which training examples must include both spam emails (positive examples) and legitimate mails (negatives). However, in many spam filtering scenarios, obtaining legitimate ...
متن کاملBlocking Blog Spam with Language Model Disagreement
We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam ...
متن کاملChinese Spam Filtering Based On Back-Propagation Neural Networks
As the email service is becoming an important communication way on the Network, the spam is increasing every day. This paper describes a new filtering model based on email content by using Back-Propagation Neural Networks (BPNN). And for the Chinese email, it uses Natural Language Processing & Information Retrieval Sharing Platform (NLPIR) system to perform Chinese word segmentation. The simula...
متن کامل